This notebook is to use the full list of interactions between Entrez IDs extracted from the DIP dataset to build a training set with both positive and negative examples. What should be produced are two tables, one with positive interactions in the form of:
Protein 1 | Protein 2 | Interacting? |
---|---|---|
12345 | 54321 | 1 |
... | ... | 1 |
Protein 1 | Protein 2 | Interacting? |
---|---|---|
12435 | 43521 | 0 |
... | ... | 0 |
In [1]:
cd /home/gavin/Documents/MRes/DIP/human/
In [2]:
ls
Basically, this boils down to taking the flat list of Entrez IDs and using the binary combinations as the negative training set. Obviously, the random combinations that are known to interact from the positive training set will have to be removed from the negative. We may also want to sample a subset of this list of combinations to reflect our belief that the DIP does not contain every interaction for the proteins that it does know about.
To find the combinations of these proteins we can use the itertools
Python package:
In [3]:
import itertools
import csv
In [4]:
#load in the flattened list of protein IDs
ids = list(flatten(csv.reader(open("flat.Entrez.txt"))))
In [5]:
#find all combinations of Entrez protein IDs
negids = map(lambda x: x, itertools.combinations(ids,2))
In [6]:
#examples
print negids[0:10]
In [7]:
#how many combinations are there?
print "Number of combinations of the full human DIP Entrez protein list is %i."%len(negids)
In [8]:
#first load in the positive examples:
posids = list(csv.reader(open("interacting.Entrez.txt"), delimiter="\t"))
print posids[0:10]
print "Number of positive examples: %i"%len(posids)
#remove entries that contain self-interactions:
posids = [(x,y) for x,y in posids if x != y]
The next part is a bit computationally intensive as the number of combinations is so large. Luckily, there is a stackoverflow post about exactly this. Turns out Python has a set type for exactly this kind of operation
In [9]:
#then can remove all the positive entries from the negative list using a the set type
posids = set(posids)
negids = set(negids)
negids = negids - posids
Unfortunately, this operation will only remove tuples that match between the two lists. Since the order is of the protein pair will also have to match between the two lists this will fail to remove elements where the order is reversed. Luckily, we can hack our way round this by reversing all the elements of the posids and repeating:
In [10]:
rposids = set([(y,x) for x,y in posids])
In [11]:
negids = negids - rposids
How many negative interactions should there be for each positive interaction. The number used by Qi was 600. Using that here, will define a variable so that it can be easily changed if required:
In [12]:
negtoposratio = 600
negN = negtoposratio*len(posids)
print "The number of positive examples is %i, therefore we require %i negative examples which can be sampled from the %i combinations available."%(len(posids), negN, len(negids))
Sampling from a large set is possible, but it would require rewriting the set type - there is a stackoverflow post on this topic. Worth trying simpler methods that are less efficient to see if they run fast enough. The simplest way would just be shuffle and slice off as many samples as we want:
In [13]:
negids = list(negids)
shuffle(negids)
In [14]:
#extract negN samples from this for our training set
negexamples = negids[0:negN]
In [15]:
print negexamples[0:10]
In [16]:
csv.writer(open("training.positive.Entrez.txt", "w"), delimiter="\t").writerows(map(lambda x: (x[0],x[1],1), posids))
In [17]:
csv.writer(open("training.negative.Entrez.txt", "w"), delimiter="\t").writerows(map(lambda x: (x[0],x[1],0), negexamples))